The ed-tree: An Index for Large DNA Sequence Databases

نویسندگان

  • Zhenqiang Tan
  • Xia Cao
  • Beng Chin Ooi
  • Anthony K. H. Tung
چکیده

The growing interest in genomic research has caused an explosive growth in the size of DNA databases making it increasely challenging to perform searches on them. In this paper, we proposed an index structure called the ed-tree for supporting fast and effective homology searches on DNA databases. The ed-tree is developed to enable probe-based homology search algorithms like Blastn which generate short probe strings from the query sequence and then match them against the sequence database in order to identify potential regions of high similarity to the query sequence. Unlike Blastn however, the homology search algorithm we developed for ed-tree supports more flexible probe model with longer probes and more relaxed matching. As a consequence, the ed-tree is not only more effective and efficient than the latest Blastn(NCBI Blast2) when supporting homology search but also takes up moderate storage compared to existing data structures like the suffix tree. To index a DNA database of 2 giga base pairs(Gbps), ed-tree only takes less than 3Gb of secondary storage which is easily handled by a desktop PC. Experiments will be shown in this paper to support our claim.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An efficient approach for sequence matching in large DNA databases

In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in respect of storage overhead, search performance, and...

متن کامل

Adapting Decision Tree-Based Method to Index Large DNA-Protein Sequence Datasets

Currently, the size of biological databases has increased significantly with the growing number of users and the rate of queries where some databases are of terabyte size. Hence, there is an increasing need to access databases at the fastest possible rate. Where biologists are concerned, the need is more of a means to fast, scalable and accuracy searching in biological databases. This may seem ...

متن کامل

Accelerating Approximate Subsequence Search on Large Protein Sequence Databases

Bioinformatics has become an active research area in recent years. The amount of mapped sequences doubles every fourteen months. BLAST has been widely employed for retrieving sequences which has similar portion(s) to a given sequence. However, BLAST has to scan the entire database every time when a query is issued. This can be very time consuming especially when the database is large. In this p...

متن کامل

The Hybrid Digital Tree and Its Applications to Genomic Sequence Databases

THE HYBRID DIGITAL TREE AND ITS APPLICATIONS TO GENOMIC SEQUENCE DATABASES By Qiang Xue This dissertation focuses on index structures, search algorithms, and applications for large string databases whose indexes cannot fit entirely in the main memory (RAM). String searching is a classic research topic that has received increasing attention in recent years, due to the rapid growth of digital tex...

متن کامل

SST: An algorithm for searching sequence databases in time proportional to the logarithm of the database size

We have developed an algorithm, called SST (Sequence Search Tree), that searches a database of DNA sequences for near exact matches, in time proportional to the logarithm of the database size n. In SST, we partition each sequence into fragments of xed length called \windows" using multiple o sets. Each window is mapped into a vector of dimension 4 which contains the frequency of occurrence of i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003